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" EXHIBIT A 

FUNCTION 

BestFit makes an optimal alignment of the. best segment of similarity between two sequences. Optimal 
arguments are fomd by inserting gaps to maximize the number of matches using the local homology 
algorithm of Smith and Waterman. ^ 

DESCRIPTION 

BestFit inserts gaps j to obtain the optimal alignment of the best region of similarity between two 
sequences, and then displays the alignment in a format similar to the output from Gap. The sequences 
can be of very different lengths and have only a small segment of similarity between them You could 
take a short RNA sequence, for example, and run it against a whole mitochondrial genome. 

SEARCHING FOR SIMILARITY 

BestFit is the most powerful method in the Wisconsin Sequence Analysis Package™ for identifying the 
best region of similarity between two sequences whose relationship is unknown. 

EXAMPLE 

The sequence gamma.seq contains an Alu family sequence somewhere in the first 500 bases alu seq 
contains a generic human Alu family repeat The two sequences are aligned and the best segment of 
similarity is found with BestFit. 

% bes-tfit 

BESTFIT of what sequence 1 ? gamma. seq 

Begin (* 1 *) ? 
End (* 11375 *) ? 500 
Reverse {* No *) ? 

to what sequence 2 (* gamma. seq *) ? alu. seq 

Begin (* 1 *) ? 
End (* 207 *) ? 
Reverse (* No *) ? 

What is the gap creation penalty (* 5.00 *) ? 

What is the gap extension penalty C 0.30 *) ? 

What should Z call the paired output display file (* gamma. pair *) 
Aligning -.. 


. vjaps 
Qualitv 


% Sir-Llari-.v 


12?. 3 
0.625 
3 4.45; 
2" r 


% 
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Here is the output file. Notice how BestFit finds and displays only the best segments of rirnilarity: 

BESTFIT of: gamma. seq check: 6474 from: 1 to: 500 

Human fetal beta globins G and A gamma 

from Shea, Slightom and Smithies, Cell 26; 191-203. 

Analyzed by Smithies et al. Cell 26; 345-353. 

to: alu.seq check: 4238 from: 1 to: 207 

HSREP2 from the EMBL data library 

Human Alu repetitive sequence located near the insulin gene 
Dhruva DR., Shenk T. , Subramanian K.N.; "Integration in vivo into 
Sxmian virus 40 DNA of a sequence that resembles a certain family of 

^4S445^(lS" ed ! SPeated SeqUenCeS " ; PrOC - Natl - *"d. Sci. USA 
Co^cLck^lr 11 tablS: ^^^^^--^ta.RundatalSwgapdna.C^p 

Gap Weight: 5.000 Average Match: 1.000 

Length Weight: 0.300 Average Mismatch: -0.900 

Quality: 129.3 Length: 209 . 

Ratio: 0.625 Gaps: 3 

Percent Similarity: 84.466 Percent Identity: 84.466 

gamma. seq x alu.seq June 20, 1994 15:15 

137 AGACCAACCTGGCCAACATGGTGAAATCCCATCTCrAC . AAAAATACAAA 185 
IN III ^ i . I I I ! I I I I I I I I I i I I -II | || | n | | II | IN |l II " 
1 AGACCAGCCTGGCCAACATGGTGAAACTCCATCTCrACTGAAAATACAAA 50 

186 AATTAGACAGGCATGATGGCAAGrGCCTGTAArCCCAGCTACTTGGGAGG 235 

IHIIIJIIIIIM II I I L I I I 1 III IIIIIIIIMI inn . 

51 AATTAG^GGCATGGTGA^GTGCCrG^aATCCCAGCTACTlSBGAGG 100 

• ■ m 

236 CTGAC-GAAGGAGAArTGCTTGAACCTGGAAGGCAGGAGTTGCAGTGAGCC 285 

' ' . I I I I I — ,J I I _ I I I I I IN | I I | I I I I I I I I I I 
101 CTGAGAC*CL^GAATtCCTT^ 149 

286 GAGATCATACCACTGCACTCCAGCCTGGGTGACAGAACAAGACTCTGTCT 335 

,„ ' l ' i ll r ^li.-.J||lll I I f ^1 I 1 I | I | | I .[ | 

150 GAGATCGCACGGCTGCACTCCAGCCT '. GGTGACAGAGCGAGACTCCXTCT 198 

336 CAAAAAAAA 344 

195 CAAAAAAAA 2j~ 
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RELATED PROGRAMS 

When you want an alignment that covers the whole length of both sequences, use Gap. When you are 
trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a 
multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. 
DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow 
displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average 
similarity of two or more aligned sequences at each position in the alignment. Pretty displays 
alignments of several sequences. LineUp is an editor for editing multiple sequence alignments. 
CompTable helps generate scoring matrices for peptide comparison. 

ALGORITHM 

BestFit uses the ' heal homology algorithm of Smith and Waterman (Advances in Applied 
Mathematics 2; 482-^8$ (1981)) to find the best segment of similarity between two sequences. BestFit 
reads a scoring matrix that contains values for every possible GCG symbol match (see the LOCAL 
DATA FILES topic below). The program uses these values to construct a path matrix that represents 
the entire surface of comparison with a score at every position for the best possible alignment to that 
point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix 
values of the matches in that alignment, less the gap creation penalty times the number of gaps in that 
alignment, less the gap extension penalty times the total length of all gaps in that alignment The gap 
creation and gap extension penalties are set by you. If the best path to any point has a negative value, 
a zero is put in that position. 


After the path matrix is complete, the highest value on the surface of comparison represents the end of 
the best region of similarity between the sequences. The best path from this highest value backwards 
to the point where the values revert to zero is the alignment shown by BestFit This alignment is the 
best segment of similarity between the two sequences. 

For nucleic acids, the default scoring matrix has a match value of LO for each identical symbol 
comparison and -0.90 for each non-identical comparison (not considering nucleotide ambiguity symbols 
for this example). The quality score for a nucleic acid alignment can, therefore, be determined using 
the following equation: 

Quality = 1.0 x TotalMatches + -0.90 x TotalMismatches 
- (GapCreationPenalty x GapNumber) 

- (GapExtensionPenalty x Tot alLengthOf Gaps) 

The quality scots for a protein alignment is calculated in a similar manner. However, while the default 
nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein 
scoring matrix has different values for the various non-identical amino acid comparisons. The quality 
score for a protein alignment can therefore be determined using the following equation (where Total 
is the total number of A-A (Ala-Ala) matches in the alignment, CmpVal^ is the value for an A-A 
comparison in the scoring matrix, -"ai^ is. the total number of A-B *(Ala-Asx) matches in the 
alignment, cnpval^ is the value for an A-B comparison in the scoring matrix, ...) : 

Qualify = CmpVal^ x Total 
* CinpVal x Total 

* « A3 

- CnoVal x Total 


- 3r*p7al_ x Total 

- (SapCreaticnreaaity x GapNumber) 

- {GapExtensionPenalty x TotalLengthOf Gaps) 


For a more complete discussion of scoring matrices, see the Data Files manual. 


5-22 BestFtt 
CONSIDERATIONS 


Comparison 


BestRt Always Finds Something 

BestFit always finds an alignment for any two sequences you compare - even if there is no 
significant sinulanty between them! You must evaluate the result? critically to dedde if the 
segment shown is not just a random region of relative similarity. 

The Segments Shown Obscure Alternative Segments 

? a e n St ^l° nly ? T ° ne ^^^^so if there are several, all but one is obscured. You 
^JS^r^f^ Wth matrix (see the Compare and DotPlot 

programs). Alternatively you can run BestFit on ranges outside the ranges of similarity found 
in earlier runs to bnng other segments out of the shadow of the best segment 

The Best Rt is Only One Member of a Family 

!^™I?f ^g°rithms, *• ^ignment displayed is a member of the family of best 

alignments. This family may have other members of equal quality, but will not have anv 
member with a higher quality. The family is usually significantly different for different choices 
of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for 

SSrzsxsr Program " to ieam mwe — * « 

The Surface of Comparison 

The magnitude of the computer's job is proportional to the area of the surface of comparison. 
T^t area is determined by the product of the lengths of the two sequences compared. BestFit 
can evaluate a surface of up to 3.5 million elements. This surface would be large enough to 
compare two sequences approximately 1,870-symbols long, or one sequence 200-symbok long 
ZtTl v "T" 17 ' 500 - s y mb ? ls lon &- ™« you have much longer sequeSeTthat arf 
effidSitiy ^ ^ command - lin e option -LlMit to use the surface more 

The Public Scoring Matrix for Nucleic Acid Comparisons is Very Stringent 

St/* 0 ™? ^n t ^ SWgapdn S: Cm ^ Pen f ZeS mismatches -°- 9 50 segments found may be very 
\2? > I P ™ * meaDS **** ^amat cannot be extended by three bases to pick one 
extra match. The scoring matrix used by Smith and Waterman, when local alignments were first 
descnbed^ used -0.333 for the mismatch penalty. You can use Fetch to copy ranTmdna.cS S 
pSj at iT ' CmP *° ^ **" ValU6S ' ° r ^ nws eapdna.<nnp, which has no mismatch 

Rapid Alignment 

When possible, BestFit tries to find the optimal alignment very quickly. If this rapid alignment 
is not unambiguously optimal, BestFit automatically realigns the sequences to calculate the 
optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen 
(.-—igr.i-g . . . ; is displayed twice for a single alignment 

ALIGNING LONG SEQUENCES 

Th^s program can align very long sequences if you know roughly where the alignment of interest 
S ^^P™^^^ command line option -LlMit. Then set the starting coordinates for 
stnLn^ 6 T nt 7 ^ ? 6 aH 8 nai « nt °f ^terest begins and set gap shift limits on each 

sequence. The program then aligns the sequences from your starting point such that the sequences do 
not get out of phase by more than the gap shift limits you have set. If you started both sequences at 
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W 55^ ° ne 3nd S6t &e H" Sh ? W &r sequence one to 100 for S ^ence two to 50 then 
serene! Zo^'™ ~ "* * ^ * baSe ° Utside ° f * e ™& *™ 300 to ^on 

S3irS i iii" Xa J? t &e C ° m ^ and Une « P"««a automatically sets gap shift limits if they are 
l^t f ^v*"** of se< J" en «* to proceed. In this case, the program limits XtS 

length of gaps that can beinserted into each sequence and calculates the best ahgnment w*W jfcs 

3SS?*£ "i ° f "? prtOT - ^ P ro ^ m ^n performs a caS^lete^nte 

?^^'? n,,Wllt - e0 S I *! 0,,a * r be ^P^^ there were no restriction on the total Whrf 
gaps in each sequence. If the program cannot rule out this possibility, it displays the Sun 
*7 1+ ig T nt ". n0t ^« a ^eed to: be optimal *** Because the criteria uidfatt! 
calculation for guaranteeing an optimal aHgnment are very stringent, a limited angnment often n^avbe 
optnnal even rf this message is displayed. In any event, the program 'continues toSSS-T 

EVALUATING ALIGNMENT SIGNIFICANCE 

^L P / 0gr ^ ff 3 help y0U 6valuate significance of the alignment, using a simple statistical 

2£ffl3 ™t* * "^ d °^ 2at i° nS 00111111311(1 ^ °P tion - ^T^d seance is r^eaSdty 
amtammg its length and composition, and then realigned to the first sequence. The average 

fit £*' P ° r Standard deviation ' of 811 randomized alignments is reported inTS 

output file. You can compare this average quality score to the quality score Tf the actual aHgnment to 

wi£ ^ agn f Cance 0f 1316 The number of randomizations can be speSlng 

with the -RANdomizations command line qualifier; the default is 10. g 

The score of each randomized aHgnment is reported to the screen. You can use <Ctxl>c to interrupt 
Smp^d ^ 6 r6SUltS ^ ^ that have E£ 

Jti^ £ 3ST D rjS? JfTaSff are 

nSd^ 
ALIGNMENT METRICS 

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity. 

Sk^SSST'S - ^ iS .? e metrfC ma3dmized order to ahgn the sequences. Ratio is the 
EriSL^ W ? ^ ""fr fl L hM " ^ *• Sh0rter "■»«*■ Percent Wenti< y ^ the percent of the; 
^V^L^ Y T ^ PerC6nt SiD ? arity is * e P 6 "^ of *• ^ols that are similar. 
I ™£ If f^f? from gaps are ignored. A similarity is scored when the scoring matrix value for 

bvTe tSS^L" ° r v QUal *° °- 5 °' ****** ™» ft«!held is also used 

it -SL^ZS^T Wh6n ? PUt a V (colon) between two You can reset 

pS" ?Vn ? * e , second option parameter of -PAlr. For instance, the expression 

-PAir=l. 0,0. 5 would set the similarity threshold to 0.5. 

PEPTIDE SEQUENCES 

If your input sequences are peptide sequences, this program uses a scoring matrix with matches scored 
a* i o and mismatches scored according to the evolutionary distance between the amino acids as 
6*45™% °L986 * nOTmahzed by Gribskov (Gribskov and Burgess Nucl. Acids Res. 14(16); 
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Input sequences may not be more than 30,000-symbols long. This program cannot evaluate a surf,,. rf 
comparison larger than 5.5 million elements A 200 * 97 *nn ca ^ not evaluate a surface of 

2,300 x 2,300lLparison. See th "aESSg LONG OToSnSmST ? ff' ? ^ M a 
sequences that would normally exceed E!^^ 

manager to increase the maximum surface of comparison if your system hlTentu^ v^S 

SEQUENCE TYPE 

A^^H f BeStRt - d ?f dS .° n rt. eth . er y0Ur mput se 1 uen «(s) are protein or nucleotide. Normally 
the type of a sequence is determined by the presence of either Type • N or Tvoe • p an^l»\£wrf 

Wnl^f S ' ^ T ^'T *"* * y0Ur Se ~) « co^ct^e^to 
Appendix VI for information on how to change or set the type of a sequence. 

COMMAND-LINE SUMMARY 

All parameters for this program may be put on the command line. Use the option -CHEck to see the 
summary below «d to have a chance to add things to the command line beforelhe ^JLJeZtes 
£ ot 6 e r , SX y K bel0W ' ^ in the qualifier names are the le^ S yoTm^^e 

m order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter valueV&atSe 

Minimal Syntax: % bestfit [-INfilel=] gamma. seq [-INfile2=] alu. seq -Default 
Prompted Parameters: 

-BEGinl-1 -BEGin2=l beginning of each sequence 

-END1=500 -SND2-207 end of each sequence 

-NOREVl -NOREV2 strand of each sequence 

~i^ W Jt ght ~ 5 '\ , gaP creation P ena ^y (3.0 is protein default) 

[-O^n !i g i II • ^ eXtension P««*y ■ <0 • 1 i- Protein default 

[-OUT^ilel=l gamma. pair output file for alignment 

Local Data Files: -DATa-swgapdna .cap scoring matrix for nucleic acids 

-DATa=swgappep.cmp scoring matrix for peptides 

Optional Parameters: 

-LIMitl-499 -LIMi-2=206 limit the surface of comparison 
-RANdomizatior.s[=lD] determine average score 'from 10 randomized 

alignments 

• 'IrlTll'l ' ° • 5 ' ° • 1 thresholds for displaying ' | ' , • •> , and?.' 

the number of sequence symbols per line 
I*Jr"!l; v adds a line w± th a form feed every 60 lines 

I-.-T^-Zrr a ^ S suppresses abbreviation of large gaps with '.'s 

S^""** riakeS ZhB a^Smaent for your parameters 

Zr JZ &Z :aakeS the bcttcn alignment for your parameters 

-^sJMraary suppresses the screen sunmarv 
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LOCAL DATA FILES 


The files described below supply auxiliary data to this program. The program automatically reads 
them from a public data directory unless you either 1) have a data file with exactly the same name in 
your current working directory, or 2) name a file on the command line with an expression like 
-DATal=myfile . dat. For more information see Chapter 4, Using Data Files in the User's Guide. 

If the first sequence you name is a nucleic acid, BestFit uses the scoring matrix in the public file 
swgapdna.cmp. (SW stands for Smith and Waterman.) If the first sequence you name is a peptide 
sequence, BestFit reads swgappep.cmp instead. The presence of these files in your current working 
directory causes BestFit to read your version instead. (See the Data Files manual for more 
information about scoring matrices.) 


OPTIONAL PARAMETERS 


The parameters and switches listed below can be set from the command line. For more information, 
see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide. 

-LIMitl=20 and -LH4it2=20 

let you set gap shift limits for each sequence. When you already know of a long similaritv 
between two sequences you can "zip" them together using this mode. The beginning coordinates 
for each sequence must be near the beginning of the alignment you want to see. The alignment 
continues so that gaps inserted do not require the sequences to get out of step by more than the 
gap shift limits. You can align very long sequences rapidly. The surface of comparison is stall" 
limited to 3.5 million. The size of a comparison can be predicted by multiplying the average 
length of the two sequences by the sum of the two shift limits. 

If you add -miit to the command line without any qualifier value, the program prompts you to 
enter gap shift limits for each sequence. 

-RaNdomi2arior-s=l 0 

reports the average alignment score and standard deviation from 10 randomized alignments in 
which the second sequence is repeatedly shuffled, maintaining the length and composition of the 
original sequence, and then aligned to the first sequence. You can use the optional parameter to 
set the number of randomized alignment to some number other than 10. 

-OUTf 1 1 =2=se<mamel . gap -OUTf ile3=seqname2 . gap 

This orogram can write three different output files. The first displays the alignment of sequence 
one *itr. sequence two. The second is a new sequence file for sequence one, possibly expanded bv 
gaps to make it align with sequence two. The third, like the second, is a new sequence file for 
sequence two, possibly expanded by gaps to make it align with sequence one. The program 
writes only the first file unless there are output file options on the command line. If there are 
any output files, named on the command line, only those output files are written. If you add 
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-OUT to the command line without any cnialifying filename, then the program will write the 
second and third output files after prompting you for their names. 

Aligned sequences (in sequence files) can be displayed with GapShow. Their similarity can be 
displayed with PlotSimilarity. y 06 

-PAIr=1.0,0.5,0.1 


The paired output file from this program displays sequence similarity by printing one of three 
characters between similar sequence symbols: a pipe character I ), a colon (:), or a period () 
Normally a pipe character is put between symbols that are the same, a colon is put between 
symbols whose comparison value is greater than or equal to 0.50, and a period is put between 
symbols whose comparison value is greater than or equal to 0.10. You can change these match 
display thresholds from the command line. The three parameters for -PAir are the display 
thresholds for the pipe character, colon, and period. The match display criterion for a pipe 
character changes from symbolic identity (the default) to the quantitative threshold you have set 
in the first parameter. A pipe character will no longer be inserted between identical symbols 
unless i their comparison values are greater than or equal to this threshold. If you still want a 
pipe character to connect identical symbols, use x instead of a number as the first parameter. 
(See the Data Files mamial for more informatiQn about scoring matrices.) 

-PAGe=64 


When you print the output from this program, it may cross from one page to another in a 
frustrating way - especially when you print on individual sheets. This option adds form feeds to 
the output file in order to try to keep clusters of related information together. You can set the 
number of lines per page by supplying a number after the -PAGe qualifier. 

-WIEth=50 


puts 50 sequence symbols on each line of the output file. You can set the width to anything from 
10 to 150 symbols. 


-NOBIGGans 


suppresses large gap abbreviations, showing all the sequence characters across from large gaps. 
Usually, gaps that extend one sequence by more than one complete line of output are abbreviated 
with three dots arranged in a vertical line. 

-LOWroad and -HIGhroad 


The insertion of gaps is, in many cases, arbitrary, and equally optimal alignments can be 
generated by inserting gaps differently. When equally optimal alignments are possible, this 
program can insert the gaps differently if you select either the -LOWroad or the -HIGhroad 
options. Here are examples for the p alignment of GACCAT with GACAT with different 
parameters. 


Match = 1.0 
weight = 1.0 


MisMatch =-0.9 
Length Weight = 0.0 


.Comparison 
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For: 


Match 
Gap weight 


1. 
3. 


HighRoad: 


1 GACCAT 6 

! I I 

1 GACAT. 5 


MisMatch 
Length Weight 


Quality =3.0 


0.0 
0.0 


LowRoad: 


1 GACCAT 6 

.III 
1 .GACAT 5 


Quality =3.0 


frS^L? 6 -^ ^ Shifts . aU ° f Ae ^P S * sequence two to the left and all of the 

£^3^;, Se<31 i enCe f*^ 2* d8ht 1116 ™* exactly the opposite Wheh neither 


-SUMmary 


writes a summary of the program's work to the screen when you've used the -Default qualifier to 

P y r0gram mteraCti + T" A Summ -7 tyPicafly displays at the end ofTpro^ ru^ 
interactively. You can suppress the summary for a program run interactively with -NOST^T 

£ethis qualifier also to include a summary of the program's work in the log file for a program run in 


Printed; July 13, 1995 08:19 (1162) 


